DATA IMPORTING

In my local machine I have installed, MySQL Server and workbench with essential connectors for python.

Now, I will create a DB through python by connecting to the MySQL Server.

We have successfully connected to the MySQL Server running on 127.0.0.1:3306

Have created a DB and succesfully connected to it

Using MySQL Workbench I have created two tables: telcomcustomer-churn_1 and telcomcustomer-churn_2

telcomcustomer-churn_1 has part1 of dataset and telcomcustomer-churn_2 has part2 of dataset

The datasets have been loaded to the table using the import csv option

telcomcustomer-churn_1 table snapshot ->

tab1.PNG

telcomcustomer-churn_2 table snapshot ->

tab2.PNG

Importing table 1

Importing table 2

MERGING DATASETS

DATA CLEANSING

DATA ANALYSIS & VISUALISATION

Monthly charges seems to be left skewed, and Total Charges seems to be right skewed. Though both have almost no outliers in the data.

DATA PRE-PROCESSING

Random undersampling

Random oversampling

MODEL TRAINING, TESTING AND TUNING

Using the undersampled data

  1. BAGGING
  1. BOOSTING
  1. CHANGING HYPERPARAMETERS

Observations

  1. Here, for the undersampled data thus through accuracy score and R squared error score, Ada boost classifier works well.
  2. Ada boost takes the 4th best amount of time to build the tree and predict the values
  3. Here R squared values for the algos dont turn out to be very negative values.

Using the oversampled data

  1. BAGGING
  1. BOOSTING
  1. CHANGING HYPERPARAMETERS

Observations

  1. Here after comparing the accuracy scores and the R squared error values, Bagging Ensemble seems to be the most fit.
  2. Ada Boost, Random Forest & Xgboost comes quite close to Bagging ensemble while comparison
  3. BaggingEnsemble gives good results but takes almost a good amount of time to estimate and predict the values as compared to other algos leaving the Xgboost, Catboost and LightGb boost algos
  4. Here R squared values for the algos dont turn out to be very negative as in undersampled data values.
  5. LightGB and GB ensemble took quite less time for constructing the model as compared to otherboosting alsos

Selecting best fit model

Finalizing the model

  1. Here, we see that the TP and TN no.s are quite good.
  2. The people who might churn and are predicted that thwy wont churn is only 74, which is quite less as compared to the true positves and negatives and can be handled by the company.
  3. Here the motive is that the ones who might not churn but are predicted to churn is 279 which gives the company a better window buffer.
  4. Here since its a oversampled data, the bagging classifer works well because the number of yes value samples were quite less and it was oversampled thus the bagging classifier is used in cases where the dataset has less values and the sample size is quite large which is same as our case.

Hence we go with the Bagging Ensemble built on the Oversampled data model for using in the GUI prediction

GUI TO PREDICT CUSTOMER CHURN

ONE EXAMPLE OF INPUTS GIVEN TO GUI AND OUTPUT RECEIVED

ss1.PNG

CONCLUSION

  1. We have thus, made a GUI predictor for Company to predict will Churn or not.
  2. The model, chosen was a Bagging ensemble model with 88% accuracy.
  3. The model was compared to various ensemble techniques and boosting algos.
  4. It took a greater amount of time to perform but proved to be much accurate than others.
  5. The new boosting algorithms proved to be fast and almost near accurate to the Bagging ensemble.
  6. There was a class iimbalance which was fixed with oversampling and thus bagging performed quite good.

Suggestions on dataset

  1. The class imbalance can bias the prediction hence it could be ensured to have almost the same no. of positive an negative churn values data samples.
  2. The various values like multiple lines, online backup etc. are not that understandable to the user using the GUI. Hence can be cleared in a better way.
  3. No. of data points for class no for churn was of a sufficient amount.
  4. No. of features given was also good and contributes in a good way to the class variable churn.